USING THE TWITTER CORPUS

Twitters Terms of Use prohibit the sharing of Twitter datasets in full; instead, they must be shared in the form of a list of tweet IDs which can then be hydrated, i.e. repopulated with the tweet text content. This document gives instructions on how to hydrate the tweet IDs. For more information on the Twitter corpus, please see the MHR Corpus Documentation document.

It is recommended to use DocNow Hydrator, available for free on GitHub: https://github.com/DocNow/hydrator. You will need to link your Twitter Developer account in the Hydrator 'Settings' page before you are able to use it. This requires your Twitter Developer API access key and API secret key.

Once you have Hydrator set up, you can upload the tweet ID files, one at a time (twitter_a.txt and twitter_b.txt). After uploading, Hydrator will calculate the number of tweet IDs in the file and then prompt you to enter a name for the hydrated file in the 'Title' field. Other fields can be left blank. Click 'Add dataset' and then click 'Start' to hydrate the tweet IDs. A window will appear asking where and in which format to save the hydrated file. The default format is .json, but you can also save to .csv if you prefer.

Once the completion bar is filled with green, the hydration is complete. The outputted file will display various metadata for all of the tweets, and you can decide how you would like to utilise these. You will need to save your final file to .txt to use it with standard corpus analysis software.

The number of words stated in our corpus documentation reflects only the tweet text, and does not include word count from metadata.